fix: preserve features in IterableDataset.add_column (fixes #5752) by IgnazioDS · Pull Request #8115 · huggingface/datasets

IgnazioDS · 2026-04-02T14:48:19Z

Summary

IterableDataset.add_column() lost the dataset's feature schema on every call because it delegated to map() without passing the features argument. map() unconditionally writes info.features = features, so the default None silently overwrote the existing schema.

Root cause (iterable_dataset.py:3459):

info = self.info.copy()
info.features = features  # None when add_column() doesn't pass features

Fix: infer the new column's Arrow type via the existing _infer_features_from_batch helper, merge it with the existing features, and pass the result to map().

new_features = None
if self._info.features is not None:
    column_features = _infer_features_from_batch({name: list(column)})
    new_features = Features({**self._info.features, **column_features})
return self.map(..., features=new_features)

If the dataset had no features to begin with (_info.features is None), behaviour is unchanged — map() receives None as before.

Test plan

Reproduce the original issue: create a streaming dataset with known features, call .add_column(), assert .features is not None and contains the new column
Verify existing test_iterable_dataset.py tests still pass
Smoke-test with numpy array column input (not just list)

…ce#5752) map() unconditionally sets info.features to its `features` parameter, which defaults to None. add_column() called map() without passing features, so every add_column() call wiped out the existing schema. Fix: infer the new column's Arrow type and merge it with the existing features before delegating to map(), so the returned dataset retains the full schema including the newly added column.

lhoestq

lgtm ! I just added a suggestion, lmk what you think before we merge

lhoestq · 2026-04-07T14:16:22Z

+        # Without this, map() would set info.features=None (its default), losing all schema info.
+        new_features = None
+        if self._info.features is not None:
+            column_features = _infer_features_from_batch({name: list(column)})


this could help avoid unnecessary data copies, since _infer_features_from_batch does copy all the data into arrow

Suggested change

column_features = _infer_features_from_batch({name: list(column)})

column_features = _infer_features_from_batch({name: list(column[:config.DEFAULT_MAX_BATCH_SIZE])})

HuggingFaceDocBuilderDev · 2026-04-07T14:19:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq approved these changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve features in IterableDataset.add_column (fixes #5752)#8115

fix: preserve features in IterableDataset.add_column (fixes #5752)#8115
IgnazioDS wants to merge 1 commit intohuggingface:mainfrom
IgnazioDS:fix/streaming-add-column-loses-features

IgnazioDS commented Apr 2, 2026

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Apr 7, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	column_features = _infer_features_from_batch({name: list(column)})
	column_features = _infer_features_from_batch({name: list(column[:config.DEFAULT_MAX_BATCH_SIZE])})

Conversation

IgnazioDS commented Apr 2, 2026

Summary

Test plan

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants